OpenStreetMap - Data Wrangling with MongoDB

Artem Oppermann

Map Area: Boston, Massachusetts, USA

1. Problems Encountered in the Map

1.2 Inaccurate Street Names

In the first part of the audit I take care of the street names. I print out all the street types and the corresponding street names in the dataset to see if there are any inaccurate or unstandardized entrys. Here is the overview over the street types and corresponding street names.

'Ave': {

     'Boston Ave',
     'College Ave',
     'Concord Ave',
     'Francesca Ave',
     'Highland Ave',
     'Josephine Ave',
     'Lexington Ave',
     'Massachusetts Ave',
     'Morrison Ave',
     'Somerville Ave',
     'Western Ave',
     'Willow Ave'},

'Ave.': {

     'Massachusetts Ave.',
     'Somerville Ave.'},

'Avenue': {

     'Aberdeen Avenue',
     'Boston Avenue',
     'Burnside Avenue',
     'College Avenue',
     'Concord Avenue',
     'Cottage Avenue',
     'Divinity Avenue',
     'Eastern Avenue',
     'Fisk Avenue',
     'Highland Avenue',
     'Huron Avenue',
     'Lexington Avenue',
     'Lexington Avenue',
     'Maple Avenue',
     'Massachusetts Avenue',
     'Massachussetts Avenue',
     'Myrtle Avenue',
     'Nichols Avenue',
     'Notre Dame Avenue',
     'Packard Avenue',
     'Prichard Avenue',
     'Putnam Avenue',
     'Rindge Avenue',
     'Riverside Avenue',
     'Somerville Avenue',
     'Spencer Avenue',
     'Spring Avenue',
     'Sunnyside Avenue',
     'Western Avenue',
     'Whittemore Avenue',
     'Willow Avenue',
     'rindge Avenue',
     'somerville Avenue'},

'Broadway': {

     'Broadway'},

'Circle': {

     'Norcross Circle'
     },

'Ct': {

     'Kelley Ct'},

'Drive': {

     'Bishop Allen Drive',
     'CambridgePark Drive',
     'George P Hassett Drive',
     'Memorial Drive'},

'Elm': {

     'Elm'},

'Holland': {

     'Holland'},

'Lane': {

     'Pond Lane'},

'Park': {

     'Acorn Park',
     'Austin Park',
     'Exeter Park',
     'Granton Park',
     'Monument Park'},

'Parkway': {

     'Alewife Brook Parkway', 
     'Mystic Valley Parkway'},

'Pkwy': {

     'Birmingham Pkwy'},

'Place': {

     'Drummond Place',
     'Hancock Place',
     'Hillside Place',
     'Pleasant Place',
      'Sacramento Place'},

'Rd': {

     'Abby Rd', 
     'Soldiers Field Rd', 
     'Bristol Rd', 
     'Aberdeen Rd'},

'Road': {

     'Bay State Road',
     'Benton Road',
     'Berwick Road',
     'Bristol Road',
     'Coolidge Hill Road',
     'Dana Road',
     'Dearborn Road',
     'Dunbarton Road',
     'Ellington Road',
     'Eric Road',
     'Gordon Road',
     'Heath Road',
     'Juniper Road',
     'Lower Campus Road',
     'Matignon Road',
     'Menotomy Road',
     'Newport Road',
     'Trapelo Road',
     'University Road'},

'Row': {

     'Professors Row'},

'Square': {

     'Freedom Square', 
     'Davis Square'},

'St': {

     'Antwerp St',
     'Arsenal St',
     'Athol St',
     'Bagnal St',
     'Brentwood St',
     'Court St',
     'Duval St',
     'Elm St',
     'Everett St',
     'George St',
     'Holton St',
     'Kirkland St',
     'Litchfield St',
     'Lothrop St',
     'Mackin St',
     'Medford St',
     'Merrill St',
     'Mt Auburn St',
     'Portsmouth St',
     'Richardson St',
     'South Waverly St',
     'Ware St',
     'Waverly St',
     'Winthrop St'},

'St.': {

     'Prospect St.', 
     'Main St.', 
     'Pearl St.', 
     'Albion St.', 
     'Banks St.'},

'Street': {

     'Adamson Street',
     'Agassiz Street',
     'Akron Street',
     'Albany Street',
     'Albion Street',
     'Alcott Street',
     'Aldie Street',
     'Allen Street',
     'Alpine Street',
     'Alton Street',
     'Arlington Street',
     'Arrow Street',
     'Arsenal Street',
     'Auburn Street',
     'Beacon Street',
     'Belmont Street',
     'Blackstone Street',
     'Bond Street',
     'Bow Street',
     'Brattle Street',
     'Brentwood Street',
     'Brighton Street',
     'Brimmer Street',
     'Brookline Street',
     'Cambridge Street',
     'Carver Street',
     'Cedar Street',
     'Centre Street',
     'Cherry Street',
     'Chester Street',
     'Church Street',
     'Clark Street',
     'Colerain Street',
     'Common Street',
     'Cottage Street',
     'Cowperthwaite Street',
     'Curtis Street',
     'Cushing Street',
     'Dane Street',
     'DeWolfe Street',
     'Devereaux Street',
     'Dudley Street',
     'Dunster Street',
     'Eatonia Street',
     'Elm Street',
     'Erie Street',
     'Everett Street',
     'Fawcett Street',
     'Fayette Street',
     'Franklin Street',
     'Fremont Street',
     'Garden Street',
     'Gardner Street',
     'Granite Street',
     'Green Street',
     'Grove Street',
     'Guest Street',
     'Harvard Street',
     'High Street',
     'Hilton Street',
     'Holland Street',
     'Holton Street',
     'Holyoke Street',
     'Howard Street',
     'Hudson Street',
     'Ibbetson Street',
     'Inman Street',
     'Irving Street',
     'James Street',
     'Jay Street',
     'Kinnaird Street',
     'Lake Street',
     'Laurel Street',
     'Lexington Street',
     'Lincoln Street',
     'Linnaean Street',
     'Linwood Street',
     'Lowell Street',
     'Magazine Street',
     'Main Street',
     'Mansfield Street',
     'Marrigan Street',
     'Martin Street',
     'Mason Street',
     'Medford Street',
     'Mellen Street',
     'Michael Street',
     'Mill Street',
     'Milton Street',
     'Moulton Street',
     'Mount Auburn Street',
     'Nonantum Street',
     'Norcross Street',
     'Norris Street',
     'North Beacon Street',
     'North Harvard Street',
     'North Street',
     'North Union Street',
     'Oliver Street',
     'Orchard Street',
     'Otis Street',
     'Oxford Street',
     'Palmer Street',
     'Park Street',
     'Parker Street',
     'Pearl Street',
     'Pemberton Street',
     'Pitman Street',
     'Pleasant Street',
     'Plympton Street',
     'Princeton Street',
     'Prospect Street',
     'Quincy Street',
     'Remington Street',
     'Renfrew Street',
     'Richardson Street',
     'River Street',
     'Sacramento Street',
     'Saint Paul Street',
     'Salem Street',
     'School Street',
     'Shepard Street',
     'Sherman Street',
     'Sidney Street',
     'Silk Street',
     'Spring Street',
     'Summer Street',
     'Temple Street',
     'Travis Street',
     'Tudor Street',
     'Tyler Street',
     'Upton Street',
     'Vernon Street',
     'Vinal Street',
     'Ware Street',
     'Washburn Street',
     'Washington Street',
     'Watertown Street',
     'Waverly Street',
     'Webster Street',
     'White Street',
     'William Street',
     'Wilton Street',
     'Windom Street',
     'Winslow Street',
     'Winter Street',
     'Winthrop Street',
     'Wyman Street',
     'mount Auburn Street'},

'Terrace': {

      'Alberta Terrace'},

'Way': {

      'Memorial Way'}

}

The audit revealed some abbreviations for the street types, suchs as:

  • Ave
  • Ave.
  • Rd
  • St
  • St.
  • Pkwy

To preserve the consistency I updated the substings in the abbreviated street names according to this mapping:

  • Ave: Avenue,
  • Ave.: Avenue,
  • Rd: Road,
  • St: Street,
  • St.: Street,
  • Pkwy: Parkway,

Also some street names such as "mount Auburn Street" began with lowercase. I updated the first letters in these names to be in uppercase.

1.2 Inaccurate Postal Codes

In the second part of the audit I focus on the postal codes. Here is the overview over all postal codes in the dataset and their counts:

01238: 1
01240: 1
01944: 1
02115: 1
02128: 15
02134: 35
02134-1305: 9
02134-1306: 2
02134-1307: 29
02134-1311: 4
02134-1312: 2
02134-1313: 4
02134-1316: 3
02134-1317: 4
02134-1318: 2
02134-1319: 5
02134-1321: 4
02134-1322: 6
02134-1327: 1
02134-1409: 4
02134-1420: 9
02134-1433: 11
02134-1442: 5
02135: 240
02138: 67
02138-1901: 1
02138-2701: 2
02138-2706: 2
02138-2724: 1
02138-2742: 1
02138-2762: 1
02138-2763: 1
02138-2801: 4
02138-2901: 3
02138-2903: 5
02138-2933: 3
02138-3003: 1
02138-3824: 1
02139: 277
02140: 33
02140-1340: 1
02140-2215: 1
02142: 1
02143: 40
02144: 117
02145: 1
02155: 37
02174: 1
02458: 1
02472: 32
02474: 132
02474-8735: 1
02476: 9
02478: 11

Zipcodes in Boston starts all with 02 and are five digits long. However the dataset contains three zipcodes that starts with 01. It appears beeing a mistake by the editor of the OSM data because these zipcodes belong to the neighbor area of Boston. Furthermore there are zipcodes that have also some trailing numbers, such as:

  • 02134-1433

  • 02134-1442

  • 02138-1901

Because I wanted the zipcodes to be all consistent five digits long numbers I deleted these trailing numbers. For example 02134-1433 became 02134.

1.3 Inaccurate Cuisine Types

In this subsection I take care of the cuisines types in the dataset. Here is the overview over all cuisines there are in the dataset and their count:

american: 12
argentinian: 1
asian: 4
asian;vegan: 1
bagel: 1
Bakery: 1
barbecue: 1
burger: 6
chinese: 15
coffee_shop: 16
diner: 1
donut: 4
eritrean: 1
falafel: 1
french: 1
greek: 3
Greek-American: 1
ice_cream: 4
indian: 12
international: 1
italian: 10
italian_pizza;italian: 1
japanese: 1
kebab: 2
korean: 1
korean;barbecue: 1
mediterranean: 3
mexican: 14
Modern American: 1
persian: 1
pizza: 25
pizza;burger;sandwich: 1
pizza;seafood;international: 1
pub: 2
ramen: 2
regional: 4
sandwich: 9
sandwich;coffee_shop: 1
seafood: 1
soul: 1
steakhouse: 1
sushi: 2
thai: 8
tibetan: 1
turkish: 3
vegetarian: 1

The audit revealed several basic issues with the cuisines types:

  • all types begin with lowercase
  • special characters in the name such as '_' in 'ice_cream'
  • usage if ';' instead of ','

The issues are solved by updating all first letters of each name into uppercase, replacing '_' with space ' ' and exchanging ';' with ','.

1.4 Inaccurate Amenity Types

In the next part of the audit I take a look on the amenity types in the dataset. Here is the overview over all amenity types and their count:

arts_centre: 4
atm: 10
bakery: 1
bank: 41
bar: 6
bbq: 2
bench: 251
bicycle_parking: 116
bicycle_rental: 35
bicycle_repair_station: 11
books: 1
bus_station: 2
cafe: 58
car_rental: 1
car_sharing: 17
car_wash: 3
cinema: 5
clinic: 4
club_house: 1
college: 1
community_centre: 2
Conservation Land: 1
coworking_space: 1
dentist: 6
doctors: 5
dojo: 1
drinking_water: 15
embassy: 1
fast_food: 47
fire_station: 23
fountain: 6
fuel: 33
grave_yard: 2
hospital: 9
ice_cream: 1
kindergarten: 2
library: 72
magazine_box: 3
marketplace: 1
mortuary: 1
nightclub: 1
parking: 243
parking_entrance: 1
pharmacy: 15
place_of_worship: 110
plumber: 1
police: 9
post_box: 45
post_office: 14
pub: 16
public_building: 5
recycling: 9
restaurant: 180
school: 112
shelter: 1
social_facility: 5
studio: 1
swimming_pool: 2
taxi: 1
telephone: 8
theatre: 8
toilets: 3
townhall: 7
university: 51
vending_machine: 1
veterinary: 2
waste_basket: 38

The issues with amenity names are pretty much the same as they were in previous subsection:

  • first letters are in lowercase
  • special characters in the name such as '_' in 'public_building'

I solve these issues in the same way as in case of cuisine types.

1.5 Inaccurate House Numbers

In the case of house numbers there are only very few minor problems. All housenumbers are digits except for two cases where the numbers are string 'One' and 'Zero'. I replaced these string with digits 1 and 0.

2. Data Overview

The section contains some basic information about the dataset and the MongoDB queries I used to obtain the information.

File Sizes

  • boston.osm - 75,693 MB
  • osm_boston.json - 113,808 MB

Number of Documents

collection.find().count()
418955

Number of Nodes

collection.find({"type":"node"}).count()
363480

Number of Ways

collection.find({"type":"way"}).count()
55475

Number of Unique Users

collection.distinct("basic_information.user") 519

Top 3 Contributing User

pipeline=[{"$group": {"_id":"$basic_information.user", "count":{"$sum":1}}},{"$sort":{"count":-1}}, {"$limit":3}]

collection.aggregate(pipeline)

  • user: 'crschmidt', number of contributions: 228719}
  • user: 'jremillard-massgis', number of contributions: 63784}
  • user: 'ingalls_imports', number of contributions: 32453}

Number of Users with least Contribution

pipeline=[{"$group": {"_id":"$basic_information.user", "count":{"$sum":1}}},{"$group":{"_id":"$count", "num_users":{"$sum":1}}},{"$sort":{"_id":+1}},{"$limit":3}]

collection.aggregate(pipeline)

  • Count of contributions: 1, number of users: 136
  • Count of contributions: 2, number of users: 57
  • Count of contributions: 3, number of users: 29

3. Further Data Exploration using MongoDB Queries

Count of most common restaurant kinds

pipeline=[{"$match":{"amenity":{"$exists":1}, "amenity.amenity-type":"Restaurant"}}, {"$group":{"_id":"$amenity.cuisine", "count":{"$sum":1}}},{"$sort":{"count":-1}},{"$limit":4}]

collection.aggregate(pipeline)

  • 'Chinese', 'count': 15
  • 'Pizza', 'count': 14
  • 'Indian', 'count': 12

Count of most common amenities in the dataset

pipeline=[{"$match":{"amenity":{"$exists":1}}}, {"$group":{"_id":"$amenity.amenity-type","count":{"$sum":1}}},{"$sort":{"count":-1}},{"$limit":10}

collection.aggregate(pipeline)

  • 'Bench', 'count': 251
  • 'Parking', 'count': 243
  • 'Restaurant', 'count': 180
  • 'Bicycle Parking', 'count': 116
  • 'School', 'count': 112
  • 'Place Of Worship', 'count': 110
  • 'Library', 'count': 72
  • 'Cafe', 'count': 58
  • 'University', 'count': 51

Count of three most common banks in the area

pipeline=[{"$match":{"amenity":{"$exists":1}, "amenity.amenity-type":"Bank"}}, {"$group":{"_id":"$amenity.name", "count":{"$sum":1}}},{"$sort":{"count":-1}},{"$limit":4}]

collection.aggregate(pipeline)

  • 'Bank of America', 'count': 7
  • 'Cambridge Savings Bank', 'count': 4
  • 'Citizens Bank', 'count': 4
  • 'Eastern Bank', 'count': 2

Hospitals in the area

pipeline=[{"$match":{"amenity":{"$exists":1}, "amenity.amenity-type":"Hospital"}}, {"$group":{"_id":"$amenity.name"}}]

collection.aggregate(pipeline)

  • 'Riverside Health Center'
  • 'Heritage Hospital'
  • 'Central Street Health Center'
  • 'Central Hospital'
  • 'Lahey Hitchcock Clinic'
  • 'McLean Hospital'
  • 'Somerville Central Hospital'
  • 'Senior Citizens Health Center'
  • 'Mclean Hospital'

4. Additional Ideas for Improvement

If we take a look on the least ten numbers of users with least contributions according to:

pipeline=[{"$group": {"_id":"$basic_information.user", "count":{"$sum":1}}},{"$group":{"_id":"$count", "num_users":{"$sum":1}}},{"$sort":{"_id":+1}},{"$limit":25}]

collection.aggregate(pipeline)

We get:

  • Count of contributions: 1, number of users: 136
  • Count of contributions: 2, number of users: 57
  • Count of contributions: 3, number of users: 29
  • Count of contributions: 4, number of users: 25
  • Count of contributions: 5, number of users: 22
  • Count of contributions: 6, number of users: 14
  • Count of contributions: 7, number of users: 13
  • Count of contributions: 8, number of users: 6
  • Count of contributions: 9, number of users: 11
  • Count of contributions: 10, number of users: 6

It is apparent that the number of contributions is very skewed. Higher number of contributions results in fewer amount of users. According to the querys from subsection 3. only the top contributor 'crschmidt' is responsible for 228719 out of 418955 documents, which corresponds to 54,4 % of all contributions.

The participation of people in creating OSM content could be increased through an implementation of gamification techniques. For instance a leaderbord of top contributors on the OSM page could motivate people to compete with others and do more contributions to OSM.

A problem with this technique would be the fact that people who just started their contributions could never reach the top rankings, because there are already contributors like 'crschmidt' who have a massive headstart.


In [ ]: